Antipattern: Sort Data Randomly

Let's see the techniques used to sort the data that causes an antipattern.

We'll cover the following

The most common SQL trick to pick a random row from a query is to sort the query randomly and pick the first row. This technique is easy to understand and easy to implement:

Picking up a random row from the Bugs table using RAND()

Although this is a popular solution, it quickly shows its weakness. To understand this weakness, let’s first compare it to conventional sorting, in which we compare values in a column and order the rows according to the size of the value in that column. This kind of sorting is repeatable in that it produces the same results when we run it more than once. It also benefits from an index because an index is essentially a presorted set of the values from a given column.

Retrieving data from the Bugs table

Using a function that returns a random value per row makes it random whether a given row is greater or less than another row. So, the order has no relation to the values in each row. The order is also different each time we sort in this way. So far, so good — this is the result that we want.

Sorting by a non-deterministic expression (RAND()) means that the sorting cannot benefit from an index. There is no index containing the values returned by the random function. That’s the point of their being random: they are different and unpredictable each time they’re selected.

Table scan#

The sort-by-random technique is a problem for the performance of the query because using an index is one of the best ways to speed up sorting. The consequence of not using an index is that the query result set must be sorted by the database “manually.” This is called a table scan, and it often involves saving the entire result as a temporary table and sorting it by physically swapping rows. A table scan sort is much slower than an index-assisted sort, and the performance difference grows with the size of the data set.

Another weakness of the sort-by-random technique is that after the expensive process of sorting the entire data set, most of that work is wasted because all but the first row is immediately discarded. In a table with a thousand rows, why go to the trouble of randomizing all thousand when all we need is one row?

Both of these problems are unnoticeable when we run the query over a small number of rows, so during development and testing, it may appear to be a good solution. But as the volume in our database increases over time, the query fails to scale well.

Synopsis: Random Selection
Solution: In No Particular Order
Loginto save progress
Report an Issue